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Abstract 

It is shown that a real novel shares many characteristic features with a null model in which 
the words are randomly distributed throughout the text. Such a common feature is a certain 
translational invariance of the text. Another is that the functional form of the word- frequency 
distribution of a novel depends on the length of the text in the same way as the null model. This 
means that an approximate power-law tail ascribed to the data will have an exponent which changes 
with the size of the text-section which is analyzed. A further consequence is that a novel cannot 
be described by text-evolution models like the Simon model. The size-transformation of a novel is 
found to be well described by a specific Random Book Transformation. This size transformation 
in addition enables a more precise determination of the functional form of the word-frequency 
distribution. The implications of the results are discussed. 
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I. INTRODUCTION 



Some 75 years ago Zipf found that the word frequency of a language has a very particular 
"power-law like" distribution This phenomena is best known as Zipf's law and states 
that the number of occurrences of a word in a long enough written text falls off as 1/r where 
r is the occurrence-rank of a word (the smaller rank, the more occurrences) 
How well is this power law obeyed? What is its origin? What does it imply from a linguistic 
and cognitive point of view, if anything? 

Simon in Ref. f| emphasized that the fact that "power law" distributions occur in a wide 
range of seemingly unrelated phenomena suggests a general underlying stochastic nature. 
In particular he devised a general stochastic model for the writing of a text, the Simon 
model {(J. The random element in this model is tied to the actual process of evolving the 
text and not to a property of the language itself. The Simon model and its stochastic 
evolution mechanism has since its first appearance turned up in many disguises such as 
rich-get-richer models and preferential attachment [jj]. An alternative view was taken by 
Mandelbrot who proposed that Zipf's law of word frequencies could be associated with the 
collective language itself rather than with the evolution of a particular text 8|. In particular 
he proposed that the "power-law like" distribution could be 
a letter-combination information However, Miller in Ref. 
distribution of words in a collective language does not per se requer any optimization, which 
gave rise to the metaphor of a monkey randomly writing on a typewriter All these 

proposed explanations presumes that the "power-law like" distribution says nothing about 
the syntax, grammar and context correlations of a written text. Yet the word correlations 
are, of course, essential for the meaning of a text. 

In the present paper we focus on the function Wu{k), the number of distinct words which 
occur precisely k times in a written text. The correspondence of Zipf's word rank power law 
is for this quantity Wr>(k) ~ 1/k 2 |6|. We here focus on the properties of single novels, each 
novel written by a single author. In this way we ensure that both the evolution aspect of 
the text and the properties of the language always relates to the very same text. From this 
perspective a novel can perhaps be regarded as a fingerprint of the author's brain ll|. We 
demonstrate that the text of a novel display certain general features and show that these 
features are shared with a simple null model which we call the random book. 
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In section 2 we describe some general characteristic features which the text of a novel 
display. For clarity we choose one typical novel as an illustrative example. In appendix A we 
include data for a collection of novels in order to illustrate the generality of the conclusions. 
In section 3 we discuss the random book transformation which describes how the word- 
frequency distribution changes with the length of the text analyzed. It is shown that a 
real novel to good approximation transforms in the same way. It is also shown that the 
random book transformation can be used in order to obtain a sharper determination of the 
word-frequency distribution of a novel. Section 4 contains our summary and concluding 
remarks. 



II. BOOKISH FACTS 

Examples of key characteristics of the word frequencies in a novel are as follows: The 
most obvious is the word-frequency distribution of the complete novel. A word is in this 
context defined as a group of letters separated by blanks. If the book contains Wd distinct 
(different) words and a total of Wt words, then P(k) = Wd(k)/WD is the probability that 
a word, which you pick randomly in the book, is occurring fc-times in the book. This means 
that J2 k ZT p ( k ) = 1 where tax is the maximum number of times a distinct word appears 
in the book and also that Ylt—i kP{k) = Wt/Wjj = (k), which is the average number of 
times a word occurs in the book. The function P(k) is the word-frequency distribution and 
is often very broad and more or less " power-law like", P(k) ~ 1/k 1 with 7 < 2, over a 
substantial region. This is illustrated in Fig. la with data for the novel Howards End (HE 



in the following) by E. M. Forster taken from Ref. [12j where circles correspond to the raw 
data. The horizontal distribution for the largest fc-values means that only single unique 
words have the largest number of occurrences. The triangles corresponds to a log 2 -binning 
(bin i has a size of 2 l ~ 1 ) of the data and one notes that these data follow a smooth curve. 
This last fact implies that the data are produced by a stochastic process. The functional 
form P(k) ~ exp(— bfy/k 1 gives a good fit to the data (7 = 1.73 in Fig. la). The level of 
"goodness" of this fit is discussed in section III and shown in Fig. HI 

Instead of analyzing the complete book, one can analyze a section containing a total of 
Wt words. Then one finds that the "power-law slope" of the corresponding word-frequency 
distribution, P WT (k), in a novel depends on the total number of words Wt- This is illustrated 
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in Fig. lb, which shows the average word-frequency distribution for n t/l -parts of Howards 
End. The total number of words is Wt ~ 110000 which means that the n = 20-part shown 
in Fig. lb only has w T = ^ ~ 5500 words while the n = 200-part corresponds to w T ~ 550 
words. The word-frequency distribution for a section of size wt is obtained as an average 
over a large amount of sections of the same size and we use periodic boundary conditions in 
order to avoid reduced statistics due to the boundaries of the book. As will be shown below, 
real books display a strong tendency towards having the words close to randomly distributed, 
allowing for the use of periodic boundary conditions. As seen in Fig. lb the slope of the 
"power-law like" part of the distribution gets systematically steeper when taking smaller and 
smaller sections of the book. From a practical point of view this means that if you attempt 
to approximate the word frequency distribution with the function P WT {k) ~ exp(— bk^jk 1 
then the exponent 7 increases as wt decreases. The change of the shape of P WT (k) as a 
function of the total number of words wt is a characteristic feature of the word frequency in 
a book. 

Fig. 2a shows the number of distinct words wd(wt) as a function of the total number 
of words wt- the first word is always distinct which means that wd(wt = 1) = 1. As 
you go further into the book, words tend to be repeated which means that the number of 
distinct words increases slower than a straight line with slope 1. The shape of wd(wt) gives 
a characteristics of the novel since it reflects the spatial distribution of words within the 
novel. Note that the function wd(wt) and the distributions P WT (k) are directly related, 
since the average number of times a distinct word appears is (k) WT = Ylk=T kPw T (k) = 
V^W^ How would wd(wt) change if the words were completely randomly distributed 
in the book, keeping the same frequency distribution? As seen from Fig. 2a, the function 
for the randomized book (where all words are placed randomly in the book) is very close to 
the raw data of the novel. A characteristic feature of a novel is that the distribution wd{wt) 
is close to the one for the random null model of the novel. This implies that the real novel 
and the null model share some overall random features. 

The random features are also reflected in the distribution of words belonging to different 
frequency classes: the frequency class k contains all words which appears precisely k times in 
the book. For example the class k = 1 contains all the words which only occurs once in the 
book. Random with respect to frequency classes means that there is no preference for words 
belonging to a specific frequency class to appear in any particular part in the book. Thus for 
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a random book you should have encountered close to half the words belonging to a frequency 
class when you have read precisely half of the book. Fig. 2b shows the percentage of words 
belonging to a frequency class k encountered after reading half a real book as a function of 
k. The data is for the real HE and the full drawn horizontal line is the expectation value 
for a randomized HE. The grey shadings mark one and two standard deviations (using the 
same binning as for the real novel) away from the randomized HE. This means that if the 
data circles in Fig. 2b had belonged to a single realization of a random HE book then they 
would with large probability fall inside the grey areas. The actual circles in Fig. 2b give the 
data for the real novel HE. These data follow the same horizontal trend and are compatible 
with the random null model over a substantial region of k values. However a real novel is 
of course a highly purposely structured creation. Some noticable deviations in Fig. 2b can 
immediately be associated with such contructive features. The first noticable deviation in 
Fig. 2b is that the value for the frequency class k — 1 (words which only occur once in the 
book) is only 47% (an average over the collection of books in Appendix A gives 47,3%), 
which is a statistically significant deviation from 50%. The reason is that an author who 
writes a book from the beginning to the end will have a slightly decreasing tendency of 
introducing new rare words towards the end of the book. Another noticable deviations is 
the two circles higher than 50% for larger k (words occurring very often in the book). These 
deviations are actually caused by the two specific words she and her and are clearly context 
related features in the novel (a particular context in chapter four about a third into the book 
has a very low concentration of she and her). Nevertheless Fig. 2b illustrates that the overall 
tendency of the data has the same characteristic feature as the null model. For the Simon 
model, the distribution of words belonging to different frequency classes are incompatible 
with the random feature displayed by real novels: The triangles in Fig. 2b represent the 
data for a single Simon book of the same length and (k) as HE. The dashed lines give the 
analytic asymptotic behavior in the small and large k limits (see Appendix B). It is clear 
that rare words tend to appear very late in the Simon-book while common words are more 
densely positioned early in the book. As explained below, this is because the Simon model 
is a growth model. 

Another characteristic feature of the null model is that the text is translationally invari- 
ant. This means that if you divide the novel into three consecutive sections and obtain 
the functions wd(wt) separately for all three sections then these three functions show no 
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systematic trend in there deviations. Fig. 2c demonstrates that the same is to very good ap- 
proximation also true for the real novel HE. Appendix A gives data from a variety of novels 
suggesting that the qualitative agreements between the random null model and real novels 
given by Figs. 2a and c are indeed general features. Real books contain information in the 
form of a story. Different parts describe different events and surroundings which may creates 
word correlations. So, we should expect some fluctuations between curves for different parts 
of a novel. But the point is that, in general, no systematic change can be observed between 
parts of a real novel. The translational invariance of the text is a characteristic feature of a 
novel. 

Whereas a real novel is in qualitative agreement with the null model, the Simon model 
is instead incompatible: Fig. 2d shows that the Simon model does not obey translational 
invariance, but instead display a strong systematic trend. The data is obtained by generating 
books of the same length as HE using the stochastical growth model by Simon 6]. The books 
are divided into three consecutive parts of equal size and the average distributions for these 
three parts are plotted in the figure. As seen the distributions systematically changes with 
the position in the book in a way that is incompatible with the translational invariance. 
This is contrary to the data for a real book (compare Fig. 2c). So what is wrong with the 
Simon model in the context of real novels? The problem can be traced to the stochastic 
element (the dice) in the model: The ground version of the Simon model goes as follows^]: 
The novel is assumed to be written by adding words in a consecutive order from the start to 
the end. Each time the author adds a word to the text it can either be a word not previously 
used in the text or an old word. There is a certain chance to add a new word and a certain 
chance to use an old one. The crucial stochastical assumption in the model is that the chance 
for picking a specific old word is directly proportional to the number of times this word has 
already been used in the text written so far. Thus the randomness in the Simon model is 
associated with picking words randomly from the text already written. As this text evolves, 
the reservoir (the text written so far) from which the random words are picked also changes. 
Hence the random element in Simon-type models explicitly depends on the growth process of 
the text. It means that the stochastic element changes with the position in the book. This is 
in contrast to the random null model, where the randomness is independent of the position 
in the book. One may also note that the resulting word-frequency distribution, P(k), for 
the Simon model ,with a constant growth rate, is independent of the length of the text. This 
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is in contrast to a real novel where the shape of the distribution changes with the length 
of the text (compare Fig. lb). The crucial point is that stochastic text evolution models in 
general have the same problem as the Simon model, including all preferential attachment 
type models [k| Q Q Q. Growth processes which are based on a stochastic element ( a 
dice) which ipso facto depends on the position in the text do not adequately reproduce the 
statistical distribution of words in a text. We emphasize that this is a fundamental structural 
feature which cannot be remedied within this class of stochastic models. This implies that 
the stochastic element in real novels belong to an altogether different stochastic class. 

A noteworthy additional characteristic feature is that the word-frequency distribution 
P WT (k) for an author does to large extent only depend on the number of words Wt written 
by the author and not on the specific book or short story. This is illustrated in Fig. 3 
by comparing a short story by D. H. Lawrence (The Prussian Officer (PO), Wt ~ 9000) 
with book sections of the corresponding size from two of his full novels. Fig. 3a is for 
Woman in Love (WL) which has Wt ~ 180000 and b) for Sons and Lovers (SL) which has 
Wt ~ 162000. As in the case of Fig. lb, the word frequency distribution for a section is the 
average over many sections of the same size. In order to obtain a section size of the same 
length as the short story we use n = 20-parts in a) and n = 18-parts in b). The agreement 
is very good in both cases except for the data of the very highest k- values. This difference is 
an artifact of comparing a snapshot (PO) with a curve resulting from averages (sectioning 
of WL and SL). 



III. THE RANDOM BOOK TRANSFORMATION 



We now return to the characteristic size dependence of the word-frequency distribution 
P WT (k) for a novel described in Fig. lb. In Fig. 4a compares this size dependence with the 
corresponding size dependence of the random null model: first we extract, directly from 
the raw data, the P UT (k) corresponding to sections n = 200-parts of the novel HE. This 
data is represented by squares in Fig. 4a. Next we randomize the words in HE. Note that 
a randomization leaves the frequency distribution P(k) invariant. From a sample of the 
randomized HE-book we extract P WT (k) corresponding to n — 200-parts of the randomized 
HE. This is given by the triangles in Fig. 4a. The overlap of the data is near perfect, 
indicating that the null model transform in very much the same way as the real novel. In 
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case of the random null model one can straighforwardly obtain the size transformation. The 
starting point is the word-frequency distribution P(k) for a book with Wt total words and 
Wd different words. The question is how P(k) relates to the word-frequency distribution, 
Pw T {^)i f° r a section size wt < Wt of the very same book. For the case when the words 
within a frequency class are randomly distributed the relation follows from combinatorics. 
The probablility for a word that appears k' times in the full book to appear k times in a 
smaller section (k' > k) can be expressed in binomial coefficients if we let P(k) and 
P WT (k) be two column matrices with Wd elements numerated by k, then 

P WT (k) = cY J A kk >P(k') (1) 

k'=k 

where A kk i is the triangular matrix with the elements 

A kk . = (^-lf- k ^—( k ') (2) 



and ) is the binomial coefficient. The coefficient C is 



Q — \ ($) 

Since A kk < is a triangular matrix with only positive definite elements it also has an inverse 
which is given by 

One should note that RBT (Random Book Transformation) only hinges on the assump- 
tion that words belonging to a frequency class are randomly distributed through out the 
book. Since this assumption is rather well obeyed by real novels (compare Fig. 2b), the near 
perfect agreement between the randomized null model and the real HE in case of the two 
n = 200-parts shown in Fig. 4a may be interpretated as a confirmation that the real novel 
and the randomized novel share some basic stochastical features. 

In Fig. 4b we start from the randomized HE and section it into parts with wt words. 
From each section size the average number of distinct words wd is determined so that one 
obtains the quantity 1/ (k) WT = ^-(wt)- An average over many sections of the same size is 
used. The result is the full drawn curve in Fig. 4b. One should note that this is in fact not a 
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curve but a very dense set of data points (each point corresponds to a different section size 
which means that the total number is Wt ~ 110000). In this way the raw data for HE given 
by the cirles in Fig. la are transformed into a very smooth curve for ^(wt)- The Bayesean 
probabalistic assumption used is that words from different word-frequency classes have no 
preferential order. As apparent from Fig. 2b and Fig. 4a this is a very reasonable Bayesean 
assumption. The point is now that the function -^(u>t) through the RBT-transformation 
uniquely determines P(k) and vice versa. In order to find the corresponding P(k) we have 
used a parametrized ansatz for P{k) and determined the parameters so as to reproduce the 
^■(wT)-data as well as possible. In Fig. 4b we have tested three different parametrization 
forms. The first is a pure power law, P WT (k) ~ 1/& 7 , (short dashed curve in Fig. 4b). Our 
conclusion is that a power law is incompatible with the data and can be ruled out. The next 
try is a power law with an exponential cut off, P WT (k) ~ exp(— bty/k 1 . This form gives a 
very resonable approximation of the data and the function representing the binned data in 
Fig. la corresponds to the long dashed curve in Fig. 4b. But one can, off course, do a little 
bit better by adding another parameter. The augmentet power law with an exponential cut 
off, P WT (k) ~ exp(— bk) / {k + c)^ 1 ^ 1 , gives an even better fit to the data (open circles in Fig. 
4b). 

As simple quantitative goodness measure, one can take the maximum absolute difference 
between the real data and the data obtained from the various parametrizations: the values 
for the power-law, power-law with exponential cut off and the augmented power-law with 
exponential cut off are approximately 0.063, 0.022 and 0.008, respectively. In Fig. 4a we 
have replotted the binned HE-data from Fig. la together with the best parametrization of 
P(k) obtained from the ^■(wr)-data in Fig. 4b (circles and dashed curve, respectively). The 
interesting point here is that our data analysis, which makes use of the RBT-transformation, 
makes it possible to distinguish between parametrizations of P(k) which would otherwise 
be very hard to distinguish. This is illustrated in Fig. 4c which directly compares the 
augemented power law with exponential cut off with the straight power law with exponential 
cut off. As seen from the Fig. 4c, there is almost no discernable difference when P(k) is 
plotted in a log-log scale. 

A consequence of the RBT-transformation is that the functional form of P(k) changes 
with the length of the text. The full drawn curve in Fig. 4a gives P(k) corresponding to n — 
200-parts of HE obtained from the parametrization of the form P(k) ~ exp(— bk) / (k+c) A; 7-1 
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determined from Fig .4b. It agrees very well with the real data. 

In Fig. 3 it was demonstrated that the word frequency distribution, associated with n-part 
sections of a novel of an author, to good approximation also describes a shorter novel by the 
same author, provided the shorter novel has the same length as the sections. One can then 
extrapolate this idea and imagine that the longer novel also can be described as a section 
of an even longer novel, and so on. This leads to the suggestion of a "meta book", a giant 
single "mother book" which characterizes the word-frequency distribution of all the writings 
of an author. An author would then, when writing a novel, be roughly pulling a section of 
wt words from this "meta book" resulting in a word-frequency distribution P WT (k). This is 
the same as transforming down the "meta book" via the RBT to the size wt- The "meta 



book" -concept will be further explored in a forthcoming paper. 
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IV. CONCLUSIONS 



We have shown that the words belonging to a frequency- class in a book have a tendency 
to be randomly distributed thoughout the text. This randomness is incompatible with 
text growth models like the Simon model[6f. This is because these models are based on 
a stochastic assumption of re- using words already written in the text. This is true for all 
growth models, independent on the detail of the growth mechanism. It was also shown 
that the word-frequency distribution of a novel has a shape which systematically depends 
on the size of the novel. Also this feature is incompatible the Simon model JfJ. Instead the 
properties of a novel were to large extent found to be shared with a random null model. 
The size transformation of this model is explicitly given by a Random Book Transformation 
(RBT) and some consequences of this were explored. We speculate that the word-frequency 
is consistent with the concept of a "meta book" which characterizes the word-frequency 
distribution of all the writings of an author. 

Our findings about the statistical properties of the words in a novel seem to be general: 
It does not matter much which author or book you pick, the overall properties are the same 
(at least for the English novels we have so far analyzed). Thus it does say something general 
about the structure of the written language used by a single author. Since language in 
general is a product of the human evolution, it also means that the statistical properties 
presumably reflects some evolutionary pressure. 
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VI. APPENDIX A: COLLECTION OF BOOKS 

TABLE I: List of the books analyzed. Wt is the total number of words in the book, Wd is the 
total number of different words in the book and Wt/Wd is the average number of times a word is 
used. The initials of the authors stand for: E.M F — > E.M. Forster. H M — > Herman Melville. G 
O — > George Orwell. TH^ Thomas Hardy. D.H. L — > D.H Lawrence. 



Author Book (abbr) 


W T 


W D 


Wt/W d 


E.M F 


Howards End (HE) 


110.224 


9.256 


11,91 




The Longest Journey (LJ) 


95.265 


8.443 


11,28 


H M 


White Jacket (WJ) 


143.368 


13.710 


10,46 




Moby Dick (MD) 


212.473 


17.226 


12,33 


G 


1984 


104.393 


8.983 


11,62 


T H 


Jude the Obscure (JO) 


146.557 


10.896 


13,45 


D.H L 


Woman in Love (WL) 


182.722 


11.301 


16,20 




Sons and Lovers (SL) 


162.101 


9.606 


16,87 




The Prussian Officer (PO) 


9.115 


1.823 


5.00 



In order to verify the generality of our results and conclusions, a collection of eight books 
(in addition to Howards End) was analyzed (see table [I]). The Prussian Officer (PO) is 
not a part of the analysis in Fig. 2 because of its small size. It is however a part of the 
analysis in Fig. 3. In order to get a quantitative measure of how much the curves for the 
three starting points, in Fig. 2c and d, differ we introduce two quatities: £, rms and £a, given 
by the expresions 
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TABLE II: A list of the eight books analyzed plus the Simon-book and one randomized version of 
HE, showing the values for £ rms and £a- 



Simon EE rand 


HE 


LJ WJ MD 1984 


JO 


WL 


SL 


Crms 1207 33 


68 


176 122 185 215 


212 


172 


349 


£a 1113 -13 


-38 


-43 -98 151 -153 


-103 


-157 


-326 




(5) 

\ 1 1 w T =0 I 

Where i and j denote the part of the book and the (...) is an average over all the combinations 
of i,j = 1,2,3 where % > j. The length of each part is Wt% = 25.000. The first equation 
gives an average root mean square distance between the curves. The second equation gives 
the average difference between two curves representing one part and a later part of the book. 
This means that if we have a trend that the curves for later parts in the book tend to have 
larger values for the w^u^-curve, then £a will be a large positive number. If the trend is 
that later parts have smaller values we will get a large negative number. And, if there is 
no trend at all, we will get a value close to zero. Figure IVHI shows the curves for the seven 
extra books and table [III shows the values of £ rms and £a- The Szmon-book from Fig. 2d 
and one randomized version of Howards End (HE rara d) are also included in table [III to give 
two reference points. 

When compared to the Simon-book, all the real books seem to have small values of £ rms 
and £a> indicating a strong resemblance to the null model of the random book. The values 
in the second row is also showing that there is no real trend among the real books, except 
for SL, which has a small negative trend (compared to the Simon-book which has a very 
strong positive trend). 
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VII. APPENDIX B: SIMON-MODEL 



In the Simon model a word is being written at every time step. With probability a a new 
word, that has never been written in the book so far, is written. And with probability 1 — a 
an old word is rewritten, chosen uniformly from the words existing in the book. This means 
that the probability for a word to be rewritten is proportional to the number of times it has 
already been written. When re-creating a real book the parameter a (= y^-) is usually a 
small number (~ 0.1) and the length of the book (T = Wt) is generally large (~ 10 5 ). 

We want to start by calculating how big a fraction of a book, written by the Simon-model, 
one has to read before having encountered half of all the words that appear only once in 
the book. To do this we need to calculate the probability that a specific word which is 
introduced at time t is not repeated through out the book with length T. At every time t' 
the probability for this word not to be rewritten is the sum of the probabilities that another 
of the words already written is rewritten ((1 — a){^-^-)) and that instead a completely new 
word is written (a). At time t, t words have been written in total and T — t words are still 
to be written, so the total probability pit) becomes 



p{t) = n 
f=t 



(l-a)(L_l) + Q 



t'=t 



1 + 



a 



I- a t' 



(?) 



We introduce the quantity p = 1 + = j^- and take the logarithm on both sides of eq. 
[7J and get 



lnp(t) = In ( - ) +y~dn ( p 

t'=t 



(8) 



.9 J ^ V /' 

Since 1/t' << 1 (except for very small times, which includes only a tiny part of the whole 
text) we make a Taylor expansion around zero, approximate the sum with an integral and 
get 

T , r T 



t' lnp 



hit' 



lnp^ + lln (I 



(9) 
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Substituting Eq. [9] into [8] gives 



Inp(t) = In ( +lnp T ' + ^ln^ 

t \ P , s ft 



= ln^-j =*p(f) = ^-J (10) 

If we write a book, then p(t) is the average number of k = 1-words one gets from the 
introduction time t, and so 

T 

Y,p(t) = w D (i), (ii) 

where W^l) is the total number of /c = 1-words in the book. 



t=l v ' 



substituting — = x 



T 



x l - a dx = T 



l/T 



X 



2-a 



1 1 



T 



W D (1) 



2-a 

T 

2-a 



2-a 



so / 



2-a 
\ 



l/T 



(12) 



To find the time, T1/2, when we have introduced half of all the k = 1-words, we solve the 
expression: 

t=l 



Ti/2 

i o Tl/2 / + \ 



(2-a 

T\/2 

~T~ 



t=i v 



substituting — = x 



l/T 

2-a 



x^dx = (2 - a 

2-a 



V x 2-a 

2-a 

2-a 



Ti/ a /T 
l/T 



(13) 



Tl/2 



(14) 
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Which is the fraction of the book one has to read before one half of the k = 1-words have 
been read. For the Simon-book in Fig. 2 (a = 0.083) this value is = 0.697. That is, 
69.7% of the book. 

Equation [TH can be generalized into 

Y = n ^~ a ( 15 ) 

Where n is the fraction of one-degree words. 



Next we want to do the same thing for k = 2-words. Now we need to calculate the 
probability that if a word is first introduced at time t\ it will only be repeated once at time 
t 2 . This probability is given by 



t-2 



p(Ma) = n 

t'=tl 



1 



1 (\ 



p \t 2 J 

t' 



n 



(16) 



where the 2 in the last product comes from now having two words with the possibility of 
being picked. 

This equation can be evaluated in a similary way as for the k = 1-case, and we get: 

p(t u t 2 )=T 2( - a ' 1 h\- a t 2 a (17) 

Again, this quantity gives the average number of k = 2-words one will get from words 
that are introduced at time ti and repeated at time t 2 , which means that 

T T 

£ fa) = W D (2) (18) 

where we sum over all possible combinations of t\ and t 2 where t 2 > t±. This can also be 
evaluated in a similar way as for the k = 1-case and we get 



W D (2) 



T 



1 



(19) 



1 — a \2 — a 3 — 2a 
The total number of words in a fc-group (all the destinct words with frequency k) is 

kWu(k). The time Ti/ 2 , when we have read half of all these words, is given by the expresion 



T1/2 Ti/2 T i/2 T 

tl = l t2=t\ t\=l t 2 =Ti/ 2 



2W D {2) 



(20) 
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The first sum counts all the words where both its appearances happen before Ti/ 2 and is 
thus counted twice. The second sum counts all the words that was introduced before Ty% 
and repeated after T 1 / 2 and is thus counted as one. Equation [20] can be evaluated into: 

f ±1/2 \ (_l 2 \ _i 1 ( T l/2 \ 

V T J {2 - a *- 2a) + 2 - a \ T ) = x (21) 

(2-0 _ 3-2a) 

Equation [21] cannot be solved analytically but a numerical solution for the Simon-book in 
Fig. 2 (a = 0.083) gives the value ^ w 0.638. 

We now have two points (k = 1 and k = 2) giving the asymptotic functional form for low 



k:s. In Fig. 2b a straight line was drawn intersecting these two point {-rp L k _ 1 = 0.697 and 
^jr- k _ 2 = 0.638) to show this asymptotic behavior. 

The derivations for this quantity gets very complicated for larger values of k since we 
are summing over all different words with the same frequency. But for very large fcs we 
have words that are alone in their frequency- group. That is, they are the only one with 
that particularly frequency. This makes the derivation much simpler and we can get the 
asymptotic behavior for large k:s. From Ref. [3] we get the equation 



k(t) ={j) (22) 

where k(t) is the number of occurrences a word will have in a book of length T if it was 
introduced at time t. We want to know at what time we have written half of those words. 
This is given by 



ki /2 (t) 



k(t) (T 1/2 X 1_ " 



2 V t 



Kt) _ 2 _ (?r° (r^~ 



k 1/2 (D ' (i^y- V 1 



1/2 



2~i= (23) 



T 

This equation holds for all k- values where Wr> (k) = 1. For the Simon-book in Fig. 2 
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(a = 0.083) This value is -4^ « 0.47 and represents the horizontal line if Fig. 2b. 
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FIGURE CAPTIONS 



Fig 1: Word frequency distribution P(k) for the book Howards End (HE): a) Circles 
give the raw data. The horizontal tail reflects that the largest number of occurrences 
corresponds to single words. Triangles give log-binned data and follow a smooth curve 
implying a stochastic origin. The actual data is to good approximation of the form 
P(k) ~ exp(— bty/k 1 with 7 = 1.73: b) P(k) changes with the section size of the book. Full 
curve represent the complete HE, long-dashed curve and short-dashed curves represents 
sections corresponding to a 20th and 200th parts of HE, respectively. The curves represent 
the log-binned data. 

Fig 2: Number of distinct words Wd(wt) as a function of the total number of words 
wt- a) Real and randomized HE given by full and dashed curve, respectively. The close 
agreement implies that the words are close to randomly distributed throughout the book: b) 
Curves describing how big a fraction of the book one has to read before having encountered 
half of all the words with a specific frequancy. The circles and triangles represent the 
real HE and a Simon-book (same size and (k) as HE) respectively. The dashed lines are 
showing the analytic asymptotic behavior of the Simon-book (see appendix B). The full 
line represents the average result for a randomized book and the gray areas shows one 
and two standard deviations away from the random book, c) wd(wt) for three different 
starting points within the book; full, long-dashed and short-dashed curves correspond to 
the beginning, middle and end of HE, respectively. The close agreement implies that the 
word distribution in a book is to good approximation translational invariant: d) The same 
different starting points as in c) assuming that the word-distribution was given by the 
Simon text growth model. The large and systematic differences shows that the Simon-type 
growth models do not describe the randomness of the word distribution in a real text. 

Fig 3: The sectioning of two full novels compared to a short story by the same author, 
a) The circles represent the binned data of the full novel Woman in Love. The triangles 
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show the sectioning (a 20th-part) of the same book down to the same size as the short story 
the Prussian Officer, shown with squares, b) The same as for a) but for the full noval Sons 
and Lovers sectioned into an 18th-part. 

Fig 4: The random book transformation (RBT). a) the data for HE (open circles) is 
parametrized (dashed curve). The dashed curve is transformed to a 200th-part of the book 
(full curve). This full curve should correspond to a 200th-part of the randomized HE (open 
triangles). The agreement is striking. The distribution corresponding to a 200th part of 
the real HE is given by the open squares. The close agreement with the triangles shows 
that the words are to large extent randomly distributed, b) The function ^(wt) for HE: 
Full curve corresponds to the randomized HE and the circles are obtained from the RBT 
using the parametrization of P(k) given in a). The agreement is perfect. The long-dashed 
curve corresponds to the data obtained from RBT using the parametrization of P(k) given 
in Fig. la and the inset, which is an in-zoomed version of the dashed squar, is showing 
how this curve is deviating from the real data. The short dashed curve in b) represents a 
power-law fit to the word-frequancy distribution which clearly fails to represent the data, 
c) is showing how similar the two parametrizations are which means that RBT determines 
P(k) to high accuracy. 

Fig Al: Complementary figure to Fig. 2a and c showing the number of distinct words 
wd{wt) as a function of the total number of words wt for seven additional books: First 
column represents counting from start to finish and the second column represents counting 
through three consecutive parts of the same size. 
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